machine learning classifier
Performance of Machine Learning Classifiers for Anomaly Detection in Cyber Security Applications
This work empirically evaluates machine learning models on two imbalanced public datasets (KDDCUP99 and Credit Card Fraud 2013). The method includes data preparation, model training, and evaluation, using an 80/20 (train/test) split. Models tested include eXtreme Gradient Boosting (XGB), Multi Layer Perceptron (MLP), Generative Adversarial Network (GAN), Variational Autoencoder (VAE), and Multiple-Objective Generative Adversarial Active Learning (MO-GAAL), with XGB and MLP further combined with Random-Over-Sampling (ROS) and Self-Paced-Ensemble (SPE). Evaluation involves 5-fold cross-validation and imputation techniques (mean, median, and IterativeImputer) with 10, 20, 30, and 50 % missing data. Findings show XGB and MLP outperform generative models. IterativeImputer results are comparable to mean and median, but not recommended for large datasets due to increased complexity and execution time. The code used is publicly available on GitHub (github.com/markushaug/acr-25).
Bias Mitigation for Machine Learning Classifiers: A Comprehensive Survey
Hort, Max, Chen, Zhenpeng, Zhang, Jie M., Harman, Mark, Sarro, Federica
This paper provides a comprehensive survey of bias mitigation methods for achieving fairness in Machine Learning (ML) models. We collect a total of 341 publications concerning bias mitigation for ML classifiers. These methods can be distinguished based on their intervention procedure (i.e., pre-processing, in-processing, post-processing) and the technique they apply. We investigate how existing bias mitigation methods are evaluated in the literature. In particular, we consider datasets, metrics and benchmarking. Based on the gathered insights (e.g., What is the most popular fairness metric? How many datasets are used for evaluating bias mitigation methods?), we hope to support practitioners in making informed choices when developing and evaluating new bias mitigation methods.
Prognosis and Treatment Prediction of Type-2 Diabetes Using Deep Neural Network and Machine Learning Classifiers
Kowsher, Md., Turaba, Mahbuba Yesmin, Sajed, Tanvir, Rahman, M M Mahabubur
Type 2 Diabetes is a fast-growing, chronic metabolic disorder due to imbalanced insulin activity.The motion of this research is a comparative study of seven machine learning classifiers and an artificial neural network method to prognosticate the detection and treatment of diabetes with high accuracy,in order to identify and treat diabetes patients at an early age.Our training and test dataset is an accumulation of 9483 diabetes patients information.The training dataset is large enough to negate overfitting and provide for highly accurate test performance.We use performance measures such as accuracy and precision to find out the best algorithm deep ANN which outperforms with 95.14% accuracy among all other tested machine learning classifiers.We hope our high-performing model can be used by hospitals to predict diabetes and drive research into more accurate prediction models.
Which Machine Learning Classifiers are Best for Small Datasets?
How applicable are these experiments? Both levels of the nested cross-validation used class-stratified random splits. So the splits were IID: independent and identically distributed. The test data looked like the validation data which looked like the training data. This is both unrealistic and precisely how most peer-reviewed publications evaluate when they try out machine learning.
Improving Costs and Robustness of Machine Learning Classifiers Against Adversarial Attacks via Self Play of Repeated Bayesian Games
Dasgupta, Prithviraj (U.S. Naval Research Laboratory) | Collins, Joseph B. (U.S. Naval Research Laboratory ) | McCarrick, Michael (U.S. Naval Research Laboratory)
We consider the problem of adversarial machine learning where an adversary performs evasion attacks on a classifier-based learner by sending queries with adversarial data of different attack strengths to it. The learner is unaware whether a query sent to it is clean versus adversarial. The objective of the learner is to mitigate the adversary's attacks by reducing its classification errors of adversarial data. To address this problem, we propose a technique where the learner maintains multiple classifiers that are trained with clean as well as adversarial data of different attack strengths. We then describe a game theoretic framework based on a 2-player repeated Bayesian game called Repeated Bayesian Sequential Game with self play, that enables the learner to determine an appropriate classifier to deploy so that the likelihood of correctly classifying the query and preventing the evasion attack is not deteriorated, while reducing the costs to deploy the classifiers. Experimental results of our proposed approach with adversarial text data shows that our RBSG with self play-based technique maintains classifier accuracies comparable with that of an individual, powerful and costly classifier, while strategically using multiple, lower cost but less powerful classifiers to reduce the overall classification costs.
Improving Mechanical Ventilator Clinical Decision Support Systems with A Machine Learning Classifier for Determining Ventilator Mode
Rehm, Gregory B., Kuhn, Brooks T., Nguyen, Jimmy, Anderson, Nicholas R., Chuah, Chen-Nee, Adams, Jason Y.
Clinical decision support systems (CDSS) will play an in-creasing role in improving the quality of medical care for critically ill patients. However, due to limitations in current informatics infrastructure, CDSS do not always have com-plete information on state of supporting physiologic monitor-ing devices, which can limit the input data available to CDSS. This is especially true in the use case of mechanical ventilation (MV), where current CDSS have no knowledge of critical ventilation settings, such as ventilation mode. To enable MV CDSS to make accurate recommendations related to ventilator mode, we developed a highly performant ma-chine learning model that is able to perform per-breath clas-sification of 5 of the most widely used ventilation modes in the USA with an average F1-score of 97.52%. We also show how our approach makes methodologic improvements over previous work and that it is highly robust to missing data caused by software/sensor error.
Creating Your First Machine Learning Classifier with Sklearn
This article was written by Kasper Fredenslund. Once we have downloaded the data, the first thing we want to do is to load it in and inspect its structure. For this we will use pandas. Pandas is a python library that gives us a common interface for data processing called a DataFrame. DataFrames are essentially excel spreadsheets with rows and columns, but without the fancy UI excel offers.
Making your First Machine Learning Classifier in Scikit-learn (Python) Codementor
One of the most amazing things about Python's scikit-learn library is that is has a 4-step modeling pattern that makes it easy to code a machine learning classifier. While this tutorial uses a classifier called Logistic Regression, the coding process in this tutorial applies to other classifiers in sklearn (Decision Tree, K-Nearest Neighbors etc). In this tutorial, we use Logistic Regression to predict digit labels based on images. The image above shows a bunch of training digits (observations) from the MNIST dataset whose category membership is known (labels 0–9). After training a model with logistic regression, it can be used to predict an image label (labels 0–9) given an image. The first part of this tutorial post goes over a toy dataset (digits dataset) to show quickly illustrate scikit-learn's 4 step modeling pattern and show the behavior of the logistic regression algorthm.
Creating Your First Machine Learning Classifier with Sklearn
But you don't know where to start, or perhaps you have read some theory, but don't know how to implement what you have learned. This tutorial will help you break the ice, and walk you through the complete process from importing and analysing a dataset to implementing and training a few different well known classification algorithms and assessing their performance. I'll be using a minimal amount of discrete mathematics, and aim to express details using intuition, and concrete examples instead of dense mathematical formulas. You can read why here. We will be classifying flower-species based on their sepal and petal characteristics using the Iris flower dataset which you can download from Kaggle here. Kaggle, if you haven't heard of it, has a ton of cool open datasets, and is a place where data scientists share their work which can be a valuable resource when learning.
Choosing a Machine Learning Classifier
How do you know what machine learning algorithm to choose for your classification problem? Of course, if you really care about accuracy, your best bet is to test out a couple different ones (making sure to try different parameters within each algorithm as well), and select the best one by cross-validation. But if you're simply looking for a "good enough" algorithm for your problem, or a place to start, here are some general guidelines I've found to work well over the years. If your training set is small, high bias/low variance classifiers (e.g., Naive Bayes) have an advantage over low bias/high variance classifiers (e.g., kNN), since the latter will overfit. But low bias/high variance classifiers start to win out as your training set grows (they have lower asymptotic error), since high bias classifiers aren't powerful enough to provide accurate models.